Skip to content

FSAs and Regular Expressions

A Finite State Automaton (FSA) is a mathematical model of computation comprising all 4 of the following:

  1. a finite number of states, of which exactly one is active at any given time;

  2. transition rules to change the active state;

  3. an initial state; and

  4. one or more final states.

We can draw an FSA by representing each state as a circle, the final state as a double circle, the start state as the only state with an incoming arrow, and the transition rules as labeled-edges connecting the states.

When labels are assigned to states, they appear inside the circle representing the state.

In this category, FSAs will be limited to parsing strings. That is, determining if a string is valid or not.

Basics

Here is a drawing of an FSA that is used to parse strings consisting of x's and y's:

Fsa.svg

In the above FSA, there are three states: A, B, and C.

The initial state is A; the final state is C.

The only way to go from state A to B is by seeing the letter x. Once in state B, there are two transition rules: seeing the letter y will cause the FSA to make C the active state, and seeing an x will keep B as the active state.

State C is a final state so if the string being parsed is completed and the FSA is in State C, the input string is said to be accepted by the FSA.

In State C, seeing any additional letter y will keep the machine in state C.

The FSA above will accept strings composed of one or more x’s followed by one or more y’s (e.g., xy, xxy, xxxyy, xyyy, xxyyyy).

Regular Expression (RE)

A Regular Expression (RE) is an algebraic representation of an FSA.

For example, the regular expression corresponding to the first FSA given above is xxyy.

The rules for forming a Regular Expression (RE) are as follows:

  1. The null string (λ) is a RE.
  2. If the string a is in the input alphabet, then it is a RE.
  3. if a and b are both REs, then so are the strings built up using the following rules:
    1. CONCATENATION. "ab" (a followed by b).
    2. UNION. "aUb" or "a|b" (a or b).
    3. CLOSURE. "a*" (a repeated zero or more times). This is known as the Kleene Star.

The order of precedence for Regular Expression operators is: Kleene Star, concatenation, and then union. Similar to standard Algebra, parentheses can be used to group sub-expressions.

For example, "dca*b" generates strings dcb, dcab, dcaab, and so on, whereas "d(ca)*b" generates strings db, dcab, dcacab, dcacacab, and so on.

If we have a Regular Expression, then we can mechanically build an FSA to accept the strings which are generated by the Regular Expression.

Conversely, if we have an FSA, we can mechanically develop a Regular Expression which will describe the strings which can be parsed by the FSA.

For a given FSA or Regular Expression, there are many others which are equivalent to it.

A "most simplified" Regular Expression or FSA is not always well defined.

Regular Expression Identities

Regular Expression Identities
1. (a*)* = a*
2. aa* = a*a
3. aa* U λ = a*
4. a(b U c) = ab U ac
5. a(ba)* = (ab)*a
6. (a U b)_ = (a_ U b*)*
7. (a U b)_ = (a*b*)_
8. (a U b)_ = a_(ba*)*

RegEx in Practice

Programmers use Regular Expressions (usually referred to as regex ) extensively for expressing patterns to search for. All modern programming languages have regular expression libraries.

Unfortunately, the specific syntax rules vary depending on the specific implementation, programming language, or library in use.

Interactive websites for testing regexes are a useful resource for learning regexes by experimentation.

An excellent online tool is https://regex101.com/.

A very nice exposition is Pattern Matching with Regular Expressions from the Automate the Boring Stuff book and online course.

Here are the additional syntax rules that we will use. They are pretty universal across all regex packages.

| Pattern | Description | | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------- | | | | As described above, a vertical bar separates alternatives. For example, gray | grey can match "gray" or "grey". | | * | As described above, the asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on. | | ? | The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour". | | + | The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac". | | . | The wildcard . matches any character. For example, a.b matches any string that contains an "a", then any other character, and then a "b" such as "a7b", "a&b", or "arb", but not "abbb". Therefore, a.*b matches any string that contains an "a" and a "b" with 0 or more characters in between. This includes "ab", "acb", or "a123456789b". | | [ ] | A bracket expression matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z]. | | [^ ] | Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed. | | ( ) | As described above, parentheses define a sub-expression. For example, the pattern H(ä | ae?)ndel matches "Handel", "Händel", and "Haendel". |

Sample Problems

Typical problems in the category will include: translate an FSA to a Regular Expression; simplify a Regular Expression; determine which Regular Expressions or FSAs are equivalent; and determine which strings are accepted by either an FSA or a Regular Expression.

FSAs and Regular Expressions

Find a simplified Regular Expression for the following FSA:

Fsa s1.png

[0/2]

FSAs and Regular Expressions

Which of the following strings are accepted by the following Regular Expression "(00*1*1)U(11*0*0)" ?

[0/2]

FSAs and Regular Expressions

Which of the following strings match the regular expression pattern "[A-D]*[a-d]*[0-9]" ?

[0/2]

FSAs and Regular Expressions

Which of the following strings match the regular expression pattern "Hi?g+h+[^a-ceiou]" ?

[0/2]